Welcome to the Data Analysis II task. In this task we are going to work on how to handle missing data as well as Data Scalling and Normalization. Data can have missing values for a number of reasons such as observations that were not collected or data corruption. We will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python.
This should always be the first step during any analysis. Depending on data sources, missing data are identified differently. Pandas identifies missing values as NaN. However, unless the data has been pre-processed to a degree that an analyst will encounter missing values as NaN. Missing values can appear as a question mark (?) or a zero (0) or minus one (-1) or a blank. So it is important to view and understand your data before anything else.
We are going to use the Dataset 'balance.txt'. From the first five observations, we can see there are several missing values. They have been represented as NaN.
import pandas as pd
import numpy as np
# for Box-Cox Transformation
from scipy import stats
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(0)
df = pd.read_table("balance.txt", delim_whitespace=True)
df.head()
How many NaN values can you point out?
Now that we have seen that there are missing values, we need to get a sense of how many missing values exist in the whole data set. We are going to find how many values we have in each column.
# get the number of missing data points per column
missing_values_count = df.isnull().sum()
# look at the # of missing points in the first ten columns
missing_values_count[0:10]
In order to get an even better sense. We need to see what percentage of all the values are missing. If we get a very high percentage we may not be able to use the data.
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
(total_missing/total_cells) * 100
Seems like only 1.77% of the values are missing
One option you have is to remove any rows or columns that contain missing values. This strategy is not generally recommended for datasets with important/sensitive data or datasets with a small number of observations. It's usually worth it to take the time to go through your data and really look at all the columns with missing values to really get to understanding your dataset.
If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() to help you do this. Let's try it out on our data.
# Create a temporary dataset
temp_df = df
# remove all the rows that contain a missing value.
temp_df.dropna().head()
As you can observe observations 0,1,2,3 have been dropped from the dataframe because they contained Null values.
The above proceedure removed observations. Now let us look int removing Columns.
# Create a temporary dataset
temp_df = df
# remove all columns with at least one missing value
columns_with_na_dropped = temp_df.dropna(axis=1)
columns_with_na_dropped.head()
Looks like all columns had atleast one missing values excepts the Balance column. This is one of the reasons why droping data for having NaN's may not be a very good solution. We'll explore other options.
We can use the Panda's fillna() function to fill in missing values in a dataframe. One option we have is to specify what we want the NaN values to be replaced with. Here, we can replace all NaN with 0.
# Create a temporary dataset
temp_df = df
# replace all NA's with 0
temp_df.fillna(0).head()
Another way is to replace the missing values with whatever values that come just before or just after it in the same column. This can be used in datasets where the observations are in some sorted or logical error.
# Create a temporary dataset
temp_df = df
# replace all NA's the value that comes directly after it in the same column,
# then replace all the reamining na's with 0
temp_df.fillna(method = 'bfill', axis=0).fillna(0).head()
Using the measures of central tendency involves substituting the missing values with the mean or median for numerical variables and the mode for categorical variables. This imputation technique works well with when the values are missing completely at random. One disadvantage is that mean imputation reduces variance in the dataset.
This is an imputation technique that utilizes variables from the observed data to replace the missing values with predicted values from a regression model. Complete observations are used to generate the regression equation; the equation is then used to predict missing values for incomplete observations. In an iterative process, values for the missing variable are inserted and then all cases are used to predict the dependent variable. These steps are repeated until there is little difference between the predicted values from one step to the next, that is they converge. The major drawback of using this method is that it reduces variability. Though we have not introduced regression yet, its important to keep this in mind.
For k-Nearest Neighbor imputation, the values are obtained by using similarity-based methods that rely on distance metrics (Euclidean distance, Jaccard similarity, Minkowski norm etc). They can be used to predict both discrete and continuous attributes. KNN workes by finding other observations that are almost similar to the observation with the missing value. For example, if the observation is 'Female' and 'Asian' we will find other users similar to her and get the mean or mode of the missing value. The main disadvantage of using kNN imputation is that it becomes time-consuming when analyzing large datasets because it searches for similar instances through all the dataset.
One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.
This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric feature is given the same importance.
For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).
By scaling your variables, you can help compare different variables on equal footing. To help solidify what scaling looks like, let's look at a made-up example. (Don't worry, we'll work with real data in just a second, this is just to help illustrate the point.)
# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000)
# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])
# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
Notice that the shape of the data doesn't change, but that instead of ranging from 0 to 8ish, it now ranges from 0 to 1.
# read in all our data
countries = pd.read_csv(r"C:\Users\byamu\Dropbox\Tony Byamungu-28775\Data Analytics and Exploration\Task 14\countries.csv")
countries.head()
# scale the goals from 0 to 1
scaled_data = minmax_scaling(countries['IT.CEL.SETS.P2'], columns = [0])
# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(countries['IT.CEL.SETS.P2'], ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
You can see that scaling changed the scales of the plots dramatically (but not the shape of the data: it looks like most campaigns have small goals but a few have very large ones)
Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.
Normal distribution: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.
In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)
# normalize the pledges (w/ Box-Cox)
positive_pledges = countries['NE.EXP.GNFS.ZS']
normalized_pledges = stats.boxcox(positive_pledges)[0]
# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_pledges, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_pledges, ax=ax[1])
ax[1].set_title("Normalized data")
# Read in Pakistan.csv and display the first 5 columns
pakistan = pd.read_csv(r"C:\Users\byamu\Dropbox\Tony Byamungu-28775\Data Analytics and Exploration\Task 14\Pakistan.csv", encoding='latin')
pakistan.head()
# get the number of missing data points per column
total_missing = pakistan.isnull().sum()
# look at the # of missing points in the first ten columns
total_missing[0:10]
Write a note on the reason you think we have missing data on the three top columns : Islamic date, Holiday type and Time. remember to classify them according to the three categories we analysed in the pdf.
i think the values are missing in this column because: there was no way to accurately record the time. perhaps the person who was collecting the data couldn't find a trusted source to accurately record the data and there's no possible way we can accurately impute replace the missing so this is MnAR
I think we have missing value in this column because everyday cannot be a holiday so it makes sense to have other rows with missing values in this column simply because it was not a holiday. and the missing values has nothing to do with its hypothetical value and the values of other data. so this is Missing completely at Random (McAR)
the Data are missing at random... because the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. MAR
For the following example, decide whether scaling or normalization makes more sense.
here because we are going use some of machine learning algorithm to build a linear regression model and the fact that the time students spend studying aren't normally ditributed NORMALISATION IS THE WAY TO GO
Here because we just need to change the range or to put the former and the latter observation on the same scale. we will need to use SCALING
# From the countries dataset scale the "EG.ELC.ACCS.ZS" column and visualize it as well
eg = countries['EG.ELC.ACCS.ZS']
eg_scaled = minmax_scaling(eg, columns=[0])
plt.subplot(1,2,1)
plt.hist(eg)
plt.subplot(1,2,2)
plt.hist(eg_scaled)
# From the countries dataset normalize the "SP.DYN.CBRT.IN" column
sp = countries['SP.DYN.CBRT.IN']
sp_normalized = stats.boxcox(sp)[0]
plt.figure(figsize=(8,8))
plt.subplot(121)
plt.hist(sp)
plt.subplot(122)
plt.hist(sp_normalized)